Statistics of selectively neutral genetic variation 
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Random models of evolution are instrumental in extracting rates of microscopic evolutionary 
mechanisms from empirical observations on genetic variation in genome sequences. In this context 
it is necessary to know the statistical properties of empirical observables (such as the local homozy- 
gosity for instance). Previous work relies on numerical results or assumes Gaussian approximations 
for the corresponding distributions. In this paper we give an analytical derivation of the statistical 
properties of the local homozygosity and other empirical observables assuming selective neutrality. 
We find that such distributions can be very non-Gaussian. 



For more than thirty years, microscopic random mod- 
els of genetic evolution have been the focus of a sub- 
stantial research effort in theoretical biology In 
the future, such microscopic models and their statistical 
analysis will be of yet increasing significance in this field: 
the amount of accurate and comprehensive data on the 
genetics of viruses, bacteria and especially the human 
genome |^-|^ has increased so considerably that it is now 
possible to test microscopic models of genetic evolution. 

Genetic information is encoded in the linear sequence 
of nucleotides in DNA molecules; the four different nu- 
cleotides occurring in DNA are usually denoted by A, C, 
G and T. A sequence of a few hundred or a few thousand 
of these forms a gene, also referred to as a locus. Mu- 
tations change individual nucleotides (e.g. from A to C) 
and thus create modified versions of loci. The resulting 
different types of loci are also known as allelic types. Be- 
cause loci consist of many nucleotides - each of which can 
be changed by mutation independently from the others 
- the number of possible allelic types is typically very 
large. To a good approximation it can thus be assumed 
that every mutation creates a new allelic type. This is 
the defining feature of the infinite- alleles model 0|. 

Empirically, genetic variation is recorded by measuring 
the frequencies LOa^ of each allelic type a at each locus 
I. Genetic variation reflects the microscopic processes of 
evolutionary dynamics. The simplest model of evolution 
proceeds by sampling with replacement each generation 
from the previous generation (at constant population size 
A'^). In addition, a number of microscopic processes take 
place, each happening at a constant (but generally un- 
known) rate. One such process is mutation, measured 
as 9 = 2Nfi where fi is the probability of mutation per 
locus per generation (in a haploid population). Another 
such process is the exchange of genetic material between 
individuals of a population measured as C = 2Nc where 
c is the probability of an exchange event per locus per 
generation js). C is termed recombination rate. 

The model of genetic evolution described here is called 
the constant-rate neutral mutation process, referred to as 



neutral process in the following. It is a stochastic model 
and assumes that no selective forces act. The neutral 
process is one of the most significant microscopic models 
of genetic evolution: not only does it provide a model 
for genetic variation at loci unaffected by selection, devi- 
ations between empirical observations and predictions of 
this neutral process allow for a qualitative characterisa- 
tion of selective effects (see p). 

There is by now an overwhelming amount of work, 
both theoretical and empirical, on the neutral process for 
the infinite alleles model. A convenient way of simulat- 
ing this process on a computer is to consider genealo- 
gies of samples of a given population |10 in the 
limit of N —^ oo. Random samples are most effectively 
generated by creating random genealogies. In this way, 
statistics of empirical observables may be obtained using 
Monte-Carlo simulations. Another possibility is to sim- 
ulate Ewen's sampling formula which determines the 
statistics of the neutral process in the limit of large C. 
Analytical work has mostly focused on calculating expec- 
tation values and variances of empirical observables [ p2[ . 
Distributions of even the simplest empirical observables 
(such as the one-locus homozygosity |^]) are not known 
analytically. The difficulty is: moments of empirical ob- 
servables are usually calculated by expanding them into 
a sum of identity coefficients . This procedure is im- 
practical for high moments. 

At the same time, the form of such distributions is of 
great interest: for example, they characterise sample-to- 
sample fluctuations. More importantly, they can be used 
to establish confidence intervals for empirical observa- 
tions. To date, such confidence intervals have routinely 
been obtained from Monte-Carlo calculations . Al- 

ternatively it has been assumed that the distributions are 
well approximated by Gaussians [|l5|. 

The aim of this paper is to calculate distributions of 
empirical observables (such as the homozygosity) in the 
neutral process for the infinite-alleles model. The remain- 
der is organised as follows: first the results for a single 
locus are described, and then those for two and more loci. 
Finally, implications of the results are discussed. 
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One locus. Consider the homozygosity F2, the prob- 
abihty that a pair of alleles (in a sample of size n with 
m allelic types) has the same allelic type. In terms of 
the allelic frequencies this probability can be expressed 
as (large n) 



Since the sum on the r.h.s. does not depend on z™, it 
can be averaged independently from z^. In the limit of 
large n, F2 and F2 have the same distribution, F2 ^ F!^- 
Using {z'')^ = r(l + fc)r(l + e)/T{l + k + 9) and 
{{l-zy)^ = 9/{l + e), 



Fo 



(1) 



The statistics of F2 is determined by the moments of F2 



h = {F.^) (2) 



where the average is over random genealogies according 
to the neutral process. The ipk may be calculated nu- 
merically in at least two ways: by generating random 
genealogies [p"oP,pl|] or by evaluating Ewen's sampling 
formula Obtaining an analytical estimate of the (pk 
is complicated by the fact that the allelic frequencies tOa 
in (|^) are not independently distributed. For instance, 
they must satisfy the constraint X^l'Li — ^■ 

To obtain analytical results we seek an approximate 
representation of the neutral process in terms of inde- 
pendent random numbers. When only one locus is of 
interest, non-recombination models apply, irrespective of 
how much gene exchange actually occurs. In this case 
the numbers Ca of allelic types a with given frequency cua 
are approximately independently distributed albeit 
only for sufficiently small cUa- Unfortunately this result 
does not yield the statistics of F2 since all frequencies uja 
enter in (gj), and not just the small cUa- 

In the following we show how the distribution of F2 can 
be determined by means of a recursion for the frequencies 
Ua- assume that there are m allelic types with frequen- 
cies uji,u!2, ■ ■ ■ T^nn obeying the normalisation condition 
^^j^Wq = 1. Add one allelic type; the corresponding 
m-\-l frequencies are defined as follows: draw a fre- 
quency u'^^^ 



malisation, define 
Thus 



Zm with density <I>(zm). To ensure nor- 



(1 - Zm)uJk for k 



1, 



Za-1 II (1 " Zb) 



, m. 



(3) 



where Za (for a > 1) are independent random variables 
with density ^(zg) and zq = 1. For ^{za) = 9{1- Zaf~^ 
it follows from that (for large values of n) the 

frequencies LOa are distributed according to the neutral 
process. 

The recursive definition (^) enables us to derive an ex- 
plicit expression for the moments of F2'. for large n 



Fi^Y.^a' = 4. + {i-zn.rY.^i 



(4) 



V A\ (2(fc-0)!r(2^ + 

' t^Vy T{l + 2k + 9) 
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(5) 



Here and above r(a:) is the Gamma function. Eq. (|^) 
provides an analytical approximation for arbitrary mo- 
ments of i^2j appropriate in the limit of large sample sizes 
n. 

One could reconstruct the distribution function 
P{x) — Prob(F2 — x) of F2 from the moments (||). It is, 
however, more convenient to derive the analogue of eq. 
(0) for P{x) itself. By definition [see eq. (||)], 

P{x)= I Az^{z)P[{x- z'^)/[l~ zf] (6) 

for < X < 1 and zero otherwise. This can be rewritten 
as 



P{x) 



dvQ{x,y)P{y) 



with the kernel 



Q{x,y) 



2a 



Vn-y ) 



H(. 



V) 



(7) 



(8) 



where a = a{x^ y) — x ~ {1 — x)y and II(z) is the Heav- 
iside step function. Note that Q{x,y) exhibits a diver- 
gence as cc,?/ — > 0. Eq. (^ is solved by expanding P{x) 
in a suitable set of basis functions on the interval [0, 1], 
resulting in an eigenvalue problem. Fig. |l| shows the re- 
sulting distributions P{x) for four values of 9. Clearly 
the statistics of F2 is very non-Gaussian. 

The calculations summarised above are not only of in- 
terest in the case of one locus, as the following paragraphs 
show (in the following L denotes the number of loci). 

Two loci. In the case of two loci {L = 2) on the same 
stretch of DNA, the joint distribution of allelic frequen- 
cies ■* depends on the rate C of gene exchange. Con- 
sider (for large n) 



F2^- 



1 = 1 



(9) 



In the limit of large C, the two genealogies for I = 1 and 
I = 2 are essentially independent and the frequencies uja^ 
are well approximated by (^) for each I (and large n). 
The distribution P{x) = Prob(i^2 = x) is thus obtained 
from the single-locus P{x) by convolution. The resulting 
distribution is shown in Fig. ^ Empirically determined 
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recombination rates are often so large that this result for 
P{x) is a good approximation: in Fig. ^ two distribu- 
tions of F2 are shown, for n = 100, = 1/2 and C = 1 
and 10, obtained from Monte-Carlo simulations. One 
observes good agreement with the prediction (shaded), 
even for values of C as low as C = 1. It must be em- 
phasised that the distribution is markedly non-Gaussian. 
The wiggles in the Monte-Carlo results are statistically 
significant; they are a consequence of the finite sample 
size (n = 100). 

Many loci. When L 3> 1, and in the limit of large C, 
the distribution of F2 [as defined in ^] is Gaussian, and 
its moments are obtained as 



1 



29 



1 



{2 + 9){3 + 0)L 



(10) 
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Discussion. In an empirical data set, n (and m) are 
necessarily finite. It must then be asked: to which extent 
are the Za independently and identically distributed for 
finite n (and to)? Fig. ]^{a,) shows Za-values determined 
from empirical data on C. jejuni 

[|l), at the locus Git A 
(n = 194 and to = 27), in comparison with the theory for 
n = 00. The empirical Za are approximately identically 
distributed, except at the edges where finite-size effects 
are observed (remember that zq = 1). Monte-Carlo sim- 
ulations for n = 194 and m = 27 confirm the effect of 
finite sample size. Fig. ^(b) is a similar plot with data 
taken from one Monte-Carlo sample. The inset of Fig. 
^(b) shows that that the Za are indeed independently dis- 
tributed. It can be concluded that the theory works well 
in the present case. 

In the remainder two implications of our results are dis- 
cussed. First, in practice it is necessary to decide whether 
empirically observed frequencies at a given locus are con- 
sistent with the neutral process. The standard statistical 
test (see p. 263) uses the distribution of F2 as an 
input (albeit with the number to of allelic types as a pa- 
rameter and not 9 as in the above equations). Since the 
distribution of F2 was unknown, it was usually deter- 
mined by Monte-Carlo simulations. Now, however, the 
result (0,||) can be used: for to > log n, eqs. (0,||) apply 
independently of whether m or is taken as the param- 
eter. The corresponding distributions are compared to 
Monte-Carlo data ^ in Fig. |l|. Shown are two cases: 
TO = 10, n = 50 and m = 10, n = 500. In both cases, the 
agreement between our results and those of Monte-Carlo 
simulations is very good. 

Second, many recent empirical studies (see for in- 
stance ||l^,|l^,^) have analysed the extent of gene ex- 
change. A common measure is the variance Vd of the 
number of pairwise differences at all loci under consid- 
eration. In the limit of C — > cxd (linkage equilibrium) 



(Ef=i(i- 



■Fr 



2^)F2^) (for the neutral process this 
evaluates to to L9{4: + 9) /[{I + 9){2 + 9){3 + 9)], see '^). 
However for finite values of C (linkage disequilibrium), 
and especially for small C, the expected value of Vb 
is larger. The empirically determined value of Vu can 



be compared to a critical value obtained under the null 
hypothesis that all loci are in linkage equilibrium. The 
corresponding null distribution is usually obtained using 
Monte-Carlo simulations fl^ , !!^ . 

In cases where the neutral model applies, the null 
distribution of Vb can be determined from eqs. (|^), 
(0) and (||). Consider first the case of large i, where 
the null distribution is approximately Gaussian. Using 



i^2'^)F2''' for large C, one obtains 



= L 



YaT[Vu]=L [4 
2e'(1872-420( 



-2413- 
^584 6*2- 



(P2 - [I 
229 6*3- 
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^2)^J (11) 

9^ + 239^+9^) 



(1 + 0)2(2 + 0)2(3 + 0)2(4 + 0)(5 + 9){6 + 9){7 + 9)' 

This variance is always larger than the corresponding 
quantity in a random shuffling scheme [l^Jl^ because 
the latter is conditioned on the homozygosity, and not 
on 9. When L is small, the null distribution will be very 
non-Gaussian, as the above results for the distribution of 
F2 show. In Fig. ^, the null distribution of Vd [as deter- 
mined from (|^,^)] is shown for the case of L = 4 and for 
four values of 9. Note that the forms of the distributions 
imply large, asymmetric confidence intervals. Finally, for 
TO > log n, the distributions in Fig. ^ are insensitive to 
whether the process is conditioned on fixed 9 or fixed k 

il- 

Conclusions. We have shown that distribution func- 
tions of empirical observables measuring genetic diver- 
sity in selectively neutral populations may exhibit strong 
non-Gaussian tails. We have found analytical approxima- 
tions for these distributions, valid for large sample sizes 
and in the limit where gene exchange is frequent; and 
have discussed implications for the statistical analysis of 
genetic variation. It is highly desirable to extend the 
present results to the case where gene exchange is rare, 
corresponding to clonal or nearly clonal populations. 
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FIG. 2. P{x) = PToh{F2 = x) for two loci (L = 2) and 
9 — 0.5, in the limit of large C and n (shaded). Also shown 
are results of Monte-Carlo simulations for n = 100 and C = 10 
(solid line) and C — 1 (dashed line). 




n = 50 




FIG. 1. P{x) = Prob(F2 = a;) for i = 1 and 9 = 0.5, 1, 2 
and 5. Inset: analytical results for P{x) compared to the 
Monte-Carlo results of |||], for m = 10 and n = 50, 500. 



FIG. 3. (a) frequencies Za from empirical uia (locus GltA 
in C. jejuni [M), compared to the neutral model for n ^ oo 
(dashed line). Also shown are results of Monte-Carlo simula- 
tions for finite n = 194 (solid line), (b) is a similar plot with 
data taken from one Monte-Carlo sample. The inset shows 
the correlation strength between Za and zt for n = 194 and 
27 alleles. Black corresponds to full correlation. 




FIG. 4. Null distribution of Vd for L = 4, and 
9 = 0.5, 1, 2 and 5 (in the limit of large C, the range of Vb is 
0<Vo< L/4). 



